Family name distributions: Master equation approach 
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Although cumulative family name distributions in many countries exhibit power-law forms, there 
also exist counterexamples. The origin of different family name distributions across countries is 
discussed analytically in the framework of a population dynamics model. Combined with empirical 
observations made, it is suggested that those differences in distributions are closely related to the 
rate of appearance of new family names. 
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I. INTRODUCTION 

Understanding the structure of a population and how 
it evolves in time has been a critical issue in modern 
societies. Having started as an economic problem, it 
soon extended to an environmental one, and states take 
censuses periodically in order to use the results to de- 
sign demographic policies. Malthus was the first who 
presented the mathematical claim, which has been ac- 
cepted as the fundamental principle of population dy- 
namics, that "population, when unchecked, increases in 
a geometrical ratio" [lj. In modern terms, he meant the 
exponential growth tendency of a population of size N 
with a constant net growth rate r, i.e., N/N = r with 
the time derivative N; this is often called the Malthusian 
growth model. Forty years after Malthus published his 
essay, Verhulst added the idea of the maximal capacity 
allowed by the environment, K, to the growth model, so 
that the growth rate r can be negative when the popula- 
tion size N exceeds K\ this is referred to as the logistic 
model [4] . In addition to the inherent variety of dynamics 
it may exhibit [3], there are also other models reflecting 
the complexity in population dynamics, one of which has 
been introduced and termed the 0-logistic model 0, [a, Q . 

Information on the structure of human population is 
not only available in many countries, but also very reli- 
able owing to modern census techniques. Various clas- 
sifications therein allow deeper insight into how subpop- 
ulations develop and interact with each other. In this 
work, we classify people according to their family names 
(or surnames), and study the family name distribution. 
These studies can also be important from the viewpoint 
of genetics in biology, since the inheritance of the fam- 
ily name is often paternal, exactly like the inheritance 
pattern of the Y chromosome. Furthermore, if one can 
identify quantitatively the origin of differences in family 
name distributions across countries, it can provide an un- 
derstanding of the social mechanism behind the naming 
behavior in human societies. 

The pattern of family name distributions in many 



TABLE I: Summary of the empirical results for family name 
distributions. The family name distribution function is writ- 
ten as P(k) ~ fc~ 7 with k the size of the family, which is 
the number of individuals who have the same family name. 
As the sampled population size TV is increased, the number 
Nf of observed family names increases either logarithmically 
(China and Korea) or algebraically (other countries), giving 
us two distinct groups. 



Region 


7 


N f 


China [7] 




In AT 


Korea [8] 


1.0 


lnJV 


Argentina [9] 




AfO.84 


Austria [10] 




jyO.83 


Berlin [11] 


2.0 




France [10] 




AfO.90 


Germany [10J 




AfO-77 


Isle of Man [12] 


1.5 




Italy [10] 




N - 75 


Japan [13] 


1.75 


jy-0.65 


Netherlands [10] 




jyO.69 


Norway [22] 


2.16 




Sicily [14] 


0.46-1.83 


TV 1 ' 


Spain [10] 




AT - 81 


Switzerland [10] 




AT - 73 


Taiwan [12] 


1.9 




United States [11, L5] 


1.94 




Venezuela [16] 




jyO.69 


Vietnam [23] 


1.43 


N - 27 
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countries have already been investigated (see Table I): In 
Japan, the family name distribution P(k) has been shown 
to have a power-law dependency on the size k of fami- 
lies, i.e., P{k) ~ fc -7 with the exponent 7 rs 1.75 [L3j |. 
Later, families in the United States and Berlin have also 
been reported to display power-law behavior with the 
similar exponent 7 w 2.0 [ill ]. The same power-law dis- 
tributions with exponents 7 ~ 1.9 and 7 w 1.5 have 
also been measured for Taiwanese family names and for 
names in the Isle of Man, respectively [12]. Extensive re- 



search in various countries ranging from Western Europe 
to South America has again found exponents around 2 
(See Refs. 0, [l(J, Ha] and references therein). In sharp 
contrast, the Korean family name distribution has been 
recently investigated, revealing the very interesting be- 
havior of 7 « 1.0 [3]. The Korean distribution is very dif- 
ferent since the cumulative distribution P c {k) (the num- 
ber of family names with more than k members, divided 
by the total number of family names) becomes logarith- 
mic, which results in an exponential Zipf plot (sizes ver- 
sus ranks of families) [g]. Throughout the present pa- 
per, the rank of a family is defined according to its size 
in descending order, i.e., the biggest family is assigned 
rank 1, and the second biggest family has rank 2, and so 
on. More strikingly, the exponentially decaying Zipf plot 
suggests that the number of family names Nf found in 
a population of size N increases logarithmically in Ko- 
rea (we observe the same behavior for the Chinese family 
names reported in Ref. [7]), in sharp contrast to the cor- 
responding results for other countries, where Nf grows 
algebraically with N. 

In this work, we investigate the possible mechanism for 
the differences of family name distributions across coun- 
tries, by using a simple model of population dynamics. 
We suggest that the difference originates from the rate of 
appearance of new family names, which is checked by em- 
pirical observation made for the history of Korean family 
names. In more detail, if new names appear linearly in 
time irrespective of the total population size, 7 = 1 is 
obtained, whereas if the number of new names generated 
per unit of time is proportional to the population size, 
7 rs 2 is concluded. We also investigate the family books 
for several family names in Korea containing genealog- 
ical trees, and extract the family name distribution to 
construct the Zipf plot, revealing that the exponential 
Zipf plot in Korea has been prevalent for at least 500 
years. Family names in other countries such as China, 
Vietnam, and Norway are newly investigated, and com- 
parisons with existing studies lead us to the conclusion 
that there are indeed two distinct groups of different fam- 
ily name distributions. 

The present paper is organized as follows: We present 
our master equation formulation, and obtain the formal 
solution for the distribution function in Sec. [TTJ A de- 
tailed analysis is then made in Sec. IIIII for the case of 
constant name generation rate, and historical observa- 
tions are also discussed. Section [V] is devoted to the 
other case when branching out from old to new names is 
allowed, which is followed by a summary in Sec. FVl 



II. POPULATION DYNAMICS: FORMULATION 

We first introduce the master equation in a general 
form to describe the time evolution of the family size, 
and then present the formal solution obtained by using 
the generating function technique. 

Let us define the probability Pj. k (s, t) for a class (fam- 



ily) to have number n(t) = k at time t given that it 
started with n(s) = j at time s: 



Pj,k(s,t) = P[n{t) = k\n(s) = j] 



(1) 



which is required to satisfy the initial condition 



p .iAs,s) 



5jk with the Kronecker delta 5 



jk 



1(0) if 



j = k (j ^ k). The time evolution of Pj^{s,t) is gov- 
erned by the following master equation: 



dPj, k (3,t) 
dt 



A*-i(t)? w _i(«,t) 

+ [/* fc+ i(t)+&+i(t)]P A *+i(«.*) 
-[A fe (i) + fi k (t) + (i k {t)]P 3 . k {s, t), (2) 



where we have made the continuous-time approximation 
that P 3 .k{s,t + 1) - Pj^ k (s 1 t) w dPj t k(s,t)/dt. For con- 
venience, we take one year as the time unit, and thus 
the rate variables A, /x, and (3 are defined in terms of 
the annual change of population. The first term in the 
right-hand side of Eq. @ describes the process in which 
the class with k — 1 members increases its members by 
1, which occurs at the birth rate Afc_i(£). The second 
term is for the opposite process that k + 1 members is 
decreased to k members, which occurs when one member 
either dies at the death rate fj,k+i(t)i or invents a new 
family name at the branching rate (3k+i{t). We consider 
only the branching process in which a person invents a 
new name; changing a name from one to an existing one 
is not allowed in our model. The last term is for the 
change from k either to fc + 1 or to k — 1, which occurs 
when a person is born, dies, and changes name, at rates 
Afe(i), /J,k(t) and Pk(t), respectively. In this work, we al- 
low birth and death rates to depend on time, and write 
them as 

X k (t) = kAc/>(t), n k (t) = kfi<f)(t), I3 k {t) = kp<t>{t). (3) 

The prefactor k is easily understood since the family with 
k members has a chance proportional to k to be picked 
up. We henceforth also assume that A > /i+/3 to describe 
a population growing in time. 

The solution of the master equation [2] is easily found 
by u sing the generating function written as (see, e.g., 
Ref. [if) 



^ 3 , s {z,t) = ^z k P^ k {s,t), 



(4) 



with the initial condition ^j iS (z, s) — z J [see Eq. (JTJ)]. It 
is straightforward to get the following partial differential 
equation for ^ by combining Eqs. ([2]) and ([4]): 

19* _ 9* 

__ = (,-l )( ,- A )_ 

with /I = (/i + (3)/X < 1, which is written in the simpler 
form 

a* _ a* 

dr dx ' 



by introducing new variables x and r as 



dr — A</> dt, 

dx = dz/(z — l)(z 



■ a). 



The solution should be written as \&(x, t) = g(x + r), 
and the functional form of g(x) is determined from the 
initial condition ^j iS (z, s) = z J (we henceforth set j = 1, 
i.e., the class started from only one member): 

1 1 \ _,i_ n w._^ 1 



* = 1 



1 



// 



1 



.(l-p,)(r-<r) _ 



1 



where t — a = j \<f>(t')dt' . By expanding the generating 
function, we reach the desired probability distribution 

f 1 - (1 - £)(1 - fir,)- 1 for k = 0, 
,V ' ' j T 7 fe - 1 (l — 77) (1 — /It?) forfc>0, 



where 



-(1-/2)(t-<t) 



77 = 



l-R 



1 - jSe-C 1 -^^-") 1-p.R' 



(5) 



-(1-/J)(t-<t) 



with i? 

So far we have focused on the size of a class first in- 
troduced at a certain time s(< t). To derive the overall 
population distribution observed at time t, one needs to 
know when each class was introduced. Let II(s) repre- 
sent the rate at which a class is introduced. If the history 
begins at t = 0, the resulting population distribution at 
time t is given by 



P(k,t) 



f* P lik (s,t)IL(s)ds 

f*n( s )ds ' 



(6) 



where Nf(i) = J H(s)ds is the total number of family 
names at time t. 

Although one can think of a further generalization 
using different time-dependent functions 4>\{t), (f>fi(t), 
4>/s(t), for the corresponding rate variables in Eq. (|3j). 
for simplicity we restrict ourselves only to the identi- 
cal form (f>(t). Within this limitation, it is noteworthy 
that our expression for the family name distribution in 
Eq. (JSJ) applies for a variety of different situations for 
arbitrary II(s) an d <t >(t). For example, the widely used 
Simon model [ll|, UM [3 LUA [2Q| corresponds to the sit- 
uation that II(s) = const and </>(i) oc 1/N(t) with the 
total population N(t). It is to be noted that the use of 
(j){t) oc \/N(t) introduces an effective competition among 
individuals: In one unit of time, only some fixed number 
of individuals are allowed to be born or die, which yields a 
linear increase of population in time, different from what 
really happened in human history. Accordingly, we focus 
below on the case </>(£) = 1 to have exponential growth 
of the population; however, we consider different choices 
for n(s). 

The assumption of time-independent rates with <p(t) = 
1 [seeEq. (3}] results in (l-p,)(r-a) = (A-/i-/3)(i-s). 



Without knowing the details of the generation mecha- 
nism of new family names, it is plausible to assume that 
new family names are introduced into the population at 
the rate 

n(s) = a + /3N(s), 

which contains both the population-independent part (a) 
and the population-proportional part (f3N). The second 
term f3N(s) can be easily motivated if we assume that 
each individual invents a new family name at a given 
probability (3, The population-independent part of the 
name generation rate should also be included to describe, 
e.g., immigration from abroad. 

Let us consider a family that started at time s. The 
expected family size at time t is computed to be [see 
Eq. ®] 



k(s,t) = J2 kp iAs,t) 



9* 



,(\-H-0)(t-a) 



(7) 



z = l 



which yields the self-consistent integral equation 

N{t) = / k(s,t)n{s)ds, 
Jo 

e (A-M-/3)(t-«)[ a + / 37v(s)]ds, 
/o 



or, in the differential form 
dN(t) 



dt 



a+(X-n)N(t), 



with the solution 



N(t) = Af(0)e (A 



_„ )t + aje^-rt - 1) 
A — [i 



,(X-fi)t 



(8) 



As is expected, the population-proportional part (3N in 
II(s) due to the change of names (from the existing one to 
a new one) has nothing to do with the increase or decrease 
of the total population, and thus only the population- 
independent part in Tl(s) enters N(t). 



III. CONSTANT NAME GENERATION RATE 

When family names appear uniformly in time and no 
branching process occurs, i.e., 

II(s) = a, 

we obtain, via change of the integration variable from s 
to 77 [see Eqs. © and ©], 



1 



Xt 



P(k,t) = - v k - l d v = — 



1 / l_ e -(*-M)i 



\tk\l- /2e-( A -^)* J 



which yields 



P(k, t ->■ 00) ~ fc~ 7 with 7 = 1. 



It is also straightforward to get the number of family 
names 



N f (t) = / U(s)ds 
Jo 



a ds ~ t. 



(9) 



which, combined with the total population N(t) 



, yields the expression Nf(t) ~ lniV(i). 



Interestingly, the above results are in perfect agree- 
ment with what has been found for Korean family name 
distribution [8]. The cumulative family name distribu- 
tion becomes logarithmic (i.e., 7=1), which gives an ex- 
ponentially decaying Zipf plot where the size is displayed 
as a function of the rank of the family. Furthermore, 
one can show directly from P(k) ~ k^ 1 that the num- 
ber of family names Nf found for the population size N 
increases logarithmically, i.e., Nf ~ IniV Q. 

The assumption of the constant rate II(s) of new name 
generation is very plausible in Korean culture: More or 
less, it is considered as taboo to invent a new family 
name, and in Korean history very few names have been 
introduced; there were only 288 family names in the year 
2000 [8]. Furthermore, only 11 names newly appeared be- 
tween 1985 and 2000, which seems to imply that f3 in Ko- 
rea is extremely close to zero. (If we assume that a = 0, 
corresponding to the upper limit estimation of /?, we get 
(3 « 1.8 X 10~ 8 per year.) Korea has preserved its family 
name system for more than two millennia, and many Ko- 
rean families still keep their own genealogical trees, from 
which their origins can be rather well dated. In order to 
check the validity of the assumption of the constant gen- 
eration rate of names, we collect information about the 
origins and sizes of family names from publicly accessible 
sources. Collecting the sizes and times of appearance for 
178 family names, around 60% of those existing, Fig. Q] 
is obtained. In Fig. [2(a), we show the present size of 
each family versus the time when it first appeared and 
it seems to be in accord with the exponential growth in 
Eq. |(7J). In Fig.QJb), we plot the number of family names 
as a function of time. Although we have included only 
178 names, the plot is again in agreement with the lin- 
ear increase of Nf(t) in Eq. §§§ over a broad range of 
time. We emphasize here that the number of Korean 
family names increases much more slowly than the total 
population. We also use several family books contain- 
ing genealogical trees [21[ . Although these books contain 
only the paternal part of the trees, the family names of 
women who were married to the members of the fam- 
ily were recorded (in Korea, women do not change their 
family names after their marriages) . We use the informa- 
tion about family names of women at various periods of 
time to plot Fig. [2j It is clearly seen that the size of the 
family versus the family rank decays exponentially for a 
broad range of periods, which confirms that the expo- 
nential Zipf plot in Korea has been prevalent for a long 
time and is not a recent trend. We have shown above 
that a family name distribution of the form P(k) ~ fc" 1 
is closely related to the constant generation rate of new 
names, i.e., H(s) = a, which has also been validated from 
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FIG. 1: (a) Each Korean family size versus its time of appear- 
ance. The times are collected from the genealogical trees of 
178 family names, and the family sizes are from the govern- 
mental census data in 1985. The family size grows exponen- 
tially since its first appearance as time passes, (b) Number 
of family names, Nf , as a function of time t in units of years. 
In Korea, Nf grows approximately linearly in time while the 
population growth is exponential. 
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FIG. 2: (Color online) Korean family size n(r) versus the rank 
r of the family (Zipf plot) extracted from the family names of 
married women in family books. The exponential decay has 
been valid for at least 500 years in Korean history. 
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FIG. 3: (Color online) (a) Size distribution of Chinese family 
names, arranged by their ranks r (the Zipf plot). The expo- 
nential shape has been maintained from the time of the Song 
Dynasty (960-1279). (b) Number of people (N) versus the 
number of family names found therein (Nf), collected in each 
province of China, showing clearly Nf ~ InJV. 



empirical historical observations. 

For another example, we present the result of our anal- 
ysis for Chinese family names in Ref. p}, where 542262 
Chinese are sampled with 1042 family names found. Al- 
though only the top 100 Chinese family names are avail- 
able in Ref. pj, the rank-size distribution (Zipf plot) ap- 
pears to have preserved an exponential tail for the almost 
a millennium, as shown in Fig- EJa) . Moreover, the num- 
ber of family names increases logarithmically with the 
number of people, as depicted in Fig. [31(b), supporting 
our argument. 

We next pursue the answer to the question of how the 
distribution changes if new names are produced at a rate 
that is not fixed but grows with the population size. 



IV. FAMILY NAME DISTRIBUTION WITH 
BRANCHING PROCESS ALLOWED 



If family names are allowed to branch out, the expo- 
nent 7 is altered. With being positive, II(s) is dom- 
inated by the exponential growth in the long run [sec 



Eq. ©] 

IT(s) =a + /3N(s) ~e (A ~ M)s , 
which we use to compute P(k, t) in Eq. ^ as follows: 



P(fc,t-> oo) ex / 
Jo 



fc-l (A-p)s 



(hj 



K fc -{l + (A- Al )/(A- M -/3)}_ 

Consequently, the family name distribution in the case 
of (3 > has 



7 



P 



X-H-/3 



in agreement with Ref. [l2J ■ It is very reasonable to as- 
sume that (3 is much smaller than A — fx, and we expect 
7 « 2 in most countries. Indeed, the United States and 
Berlin have 7 « 2.0 [ljj, which, by using the relation 
Nf ~ TV 7 " 1 discussed in Ref. Q, leads to the conclusion 
that Nf and TV are proportional to each other. Of course, 
one can confirm this linear relation from the direct cal- 
culation of the number of family names: 



N f (t) = Nf(0) + a[l- 



B 



t+- 



ft 



A — ji J A — fj, 



-\N(t)~N(Q)}, 



which confirms that Nf/N —> (3/(\ — fj) as t — > cxd. The 
above result of 7 = 2 + /3/(A — /x — 0) should be used 
carefully when (3 — » 0: If j3 is strictly zero, one cannot 
use the assumption II(s) ~ e ( A ^M)s ; an( j we recover the 
result 7 = 1 as previously shown. 

From the publicly accessible population information, 
we estimate that the Swedish population increased at the 
rate A - /x « 0.456% per year during 2004-2006. In the 
same period of time, about 100 new family names were 
introduced per month, which gives us a rough estimate 
(3 w 0.015%. Accordingly, f3/(X-fi) « 0.03, which makes 
the assumption we made above very plausible. The num- 
ber of family names in Sweden is known to be somewhere 
between 140 000 and 400 000 depending on how we count 
them. Together with the total population of about 9 mil- 
lions, we confirm that 0.016 < (N f /N) < 0.044, which is 
in accord with our expectation that Nf/N = /3/(A — fi) w 
0.03. 

The empirical findings we have referred to are listed 
in Table I, from which we suggest two main categories of 
family name systems: one with 7 « 1 and a logarithmic 
increase of Nf versus N (Korea and China), and the 
other with 7 > 1 and a power-law increase of Nf (other 
countries). The latter category has been prevalent in the 
literature, to which we also add Norway with 7 w 2.16 
(Fig. @]) and Vietnam with 7 w 1.43 (Fig. |SJ). 

We suggest above that the existence of two groups of 
family name distributions originates from the difference 
in new name generation rates, which reflects the exis- 
tence of a very different social dynamics behind the nam- 
ing behaviors across different cultures. We also point 
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FIG. 4: Cumulative distribution P c {k) versus the family size k 
of Norwegian family names, based on the survey in 2007 [22| . 
The power-law behavior is clearly seen. 
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FIG. 5: (a) Vietnamese family name cumulative distribution, 
from the phone book of Ho-Chi-Minh City, 2004. (b) The 
number of family names Nf found in certain numbers of peo- 
ple N. Both show power-law behaviors. 



other interesting case is the Japanese system. Again, 
a Japanese family name rarely undergoes the branching 
process these days, but one finds the algebraic depen- 
dency of Nf ~ TV - 65 [ll|, which indicates the fact that 
many Japanese people had to adopt their family names 
by governmental policy about a century ago. The diver- 
sity ensured at the creation appears to be maintained up 
to now characterizing the Japanese family name system. 
Consequently, the Japanese name distribution cannot be 
successfully explained by our model in which the limit 
t — » oo is taken. Another peculiar observation has been 
found in Sicily: The surname distribution from one of 
its communes shows 7 ss 0.46, possibly originating from 



the effects of isolation [14|]; mathematical treatment of 
this population has not been carried out. Within these 
limitations of our model study in which various simpli- 
fications are made implicitly and explicitly, we strongly 
believe that such an idealization in general helps one to 
sensitively check the reality and identify the most impor- 
tant issue from all the ingredients, providing a deeper 
understanding and insight. 



V. SUMMARY 

In summary, we analytically investigated the gen- 
erating mechanism of observed family name distribu- 
tions. Whereas the traditional approaches from the Si- 
mon model are based on implicit assumptions about com- 
petition within the population, we instead started from 
the first principle of population dynamics, the Malthu- 
sian growth model. The existence of branching pro- 
cesses in generating new family names was pointed out 
as the crucial factor in determining the power-law expo- 
nent 7: With and without the branching process, 7 rs 2 
and 7=1, respectively, were obtained. Genealogical 
trees collected for Korean family names were analyzed 
to confirm that the total number of names increased lin- 
early in time, justifying the assumption made in the an- 
alytic study. We additionally reported Chinese, Viet- 
namese, and Norwegian data sets to examine our argu- 
ment, which, combined with existing studies, lead us to 
the conclusion that there are two groups of family name 
distributions on the globe and that these differences can 
be successfully explained in terms of the differences in 
new name generation rates. 
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